Introduction#

This notebook contains exploratory data analysis, feature engineering and modelling for the M% Forecasting Notebook.


Fetch the data1 Downcasting2 Melting the data3 Exploratory Data Analysis4 Feature Engineering5 Modelling and Prediction6
import os
import pandas as pd
import numpy as np
import plotly_express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt
import seaborn as sns
import gc
import warnings
warnings.filterwarnings('ignore')
from lightgbm import LGBMRegressor
import joblib

1. Fetch the data#

SALES_PATH = '/ssd003/projects/forecasting_bootcamp/bootcamp_datasets/m5-forecasting-accuracy/sales_train_evaluation.csv'
CALENDAR_PATH = '/ssd003/projects/forecasting_bootcamp/bootcamp_datasets/m5-forecasting-accuracy/calendar.csv'
PRICES_PATH = '/ssd003/projects/forecasting_bootcamp/bootcamp_datasets/m5-forecasting-accuracy/sell_prices.csv'
sales = pd.read_csv('/ssd003/projects/forecasting_bootcamp/bootcamp_datasets/m5-forecasting-accuracy/sales_train_evaluation.csv')
sales.name = 'sales'
calendar = pd.read_csv('/ssd003/projects/forecasting_bootcamp/bootcamp_datasets/m5-forecasting-accuracy/calendar.csv')
calendar.name = 'calendar'
prices = pd.read_csv('/ssd003/projects/forecasting_bootcamp/bootcamp_datasets/m5-forecasting-accuracy/sell_prices.csv')
prices.name = 'prices'
sales = pd.read_csv(SALES_PATH)
sales.name = 'sales'
calendar = pd.read_csv(CALENDAR_PATH)
calendar.name = 'calendar'
prices = pd.read_csv(PRICES_PATH)
prices.name = 'prices'

Since, the validation data is now available for the days 1914-1941, Adding zero sales for days: d_1942 - d_1969(Test)

#Add zero sales for the remaining days 1942-1969
for d in range(1942,1970):
    col = 'd_' + str(d)
    sales[col] = 0
    sales[col] = sales[col].astype(np.int16)

2. Downcasting#

In this section I’ll be downcasting the dataframes to reduce the amount of storage used by them and also to expidite the operations performed on them.

  • **Numerical Columns: ** Depending on your environment, pandas automatically creates int32, int64, float32 or float64 columns for numeric ones. If you know the min or max value of a column, you can use a subtype which is less memory consuming. You can also use an unsigned subtype if there is no negative value.
    Here are the different subtypes you can use:
    int8 / uint8 : consumes 1 byte of memory, range between -128/127 or 0/255
    bool : consumes 1 byte, true or false
    float16 / int16 / uint16: consumes 2 bytes of memory, range between -32768 and 32767 or 0/65535
    float32 / int32 / uint32 : consumes 4 bytes of memory, range between -2147483648 and 2147483647
    float64 / int64 / uint64: consumes 8 bytes of memory
    If one of your column has values between 1 and 10 for example, you will reduce the size of that column from 8 bytes per row to 1 byte, which is more than 85% memory saving on that column!

  • **Categorical Columns: ** Pandas stores categorical columns as objects. One of the reason this storage is not optimal is that it creates a list of pointers to the memory address of each value of your column. For columns with low cardinality (the amount of unique values is lower than 50% of the count of these values), this can be optimized by forcing pandas to use a virtual mapping table where all unique values are mapped via an integer instead of a pointer. This is done using the category datatype.

sales_bd = np.round(sales.memory_usage().sum()/(1024*1024),1)
calendar_bd = np.round(calendar.memory_usage().sum()/(1024*1024),1)
prices_bd = np.round(prices.memory_usage().sum()/(1024*1024),1)
#Downcast in order to save memory
def downcast(df):
    cols = df.dtypes.index.tolist()
    types = df.dtypes.values.tolist()
    for i,t in enumerate(types):
        if 'int' in str(t):
            if df[cols[i]].min() > np.iinfo(np.int8).min and df[cols[i]].max() < np.iinfo(np.int8).max:
                df[cols[i]] = df[cols[i]].astype(np.int8)
            elif df[cols[i]].min() > np.iinfo(np.int16).min and df[cols[i]].max() < np.iinfo(np.int16).max:
                df[cols[i]] = df[cols[i]].astype(np.int16)
            elif df[cols[i]].min() > np.iinfo(np.int32).min and df[cols[i]].max() < np.iinfo(np.int32).max:
                df[cols[i]] = df[cols[i]].astype(np.int32)
            else:
                df[cols[i]] = df[cols[i]].astype(np.int64)
        elif 'float' in str(t):
            if df[cols[i]].min() > np.finfo(np.float16).min and df[cols[i]].max() < np.finfo(np.float16).max:
                df[cols[i]] = df[cols[i]].astype(np.float16)
            elif df[cols[i]].min() > np.finfo(np.float32).min and df[cols[i]].max() < np.finfo(np.float32).max:
                df[cols[i]] = df[cols[i]].astype(np.float32)
            else:
                df[cols[i]] = df[cols[i]].astype(np.float64)
        elif t == np.object:
            if cols[i] == 'date':
                df[cols[i]] = pd.to_datetime(df[cols[i]], format='%Y-%m-%d')
            else:
                df[cols[i]] = df[cols[i]].astype('category')
    return df  

sales = downcast(sales)
prices = downcast(prices)
calendar = downcast(calendar)
sales_ad = np.round(sales.memory_usage().sum()/(1024*1024),1)
calendar_ad = np.round(calendar.memory_usage().sum()/(1024*1024),1)
prices_ad = np.round(prices.memory_usage().sum()/(1024*1024),1)

Below plot shows how much effect downcasting has had on the memory usage of DataFrames. Clearly, we have been able to reduce sales & prices to less than 1/4th of their actual memory usage. calendar is already a small dataframe.

dic = {'DataFrame':['sales','calendar','prices'],
       'Before downcasting':[sales_bd,calendar_bd,prices_bd],
       'After downcasting':[sales_ad,calendar_ad,prices_ad]}

memory = pd.DataFrame(dic)
memory = pd.melt(memory, id_vars='DataFrame', var_name='Status', value_name='Memory (MB)')
memory.sort_values('Memory (MB)',inplace=True)
fig = px.bar(memory, x='DataFrame', y='Memory (MB)', color='Status', barmode='group', text='Memory (MB)')
fig.update_traces(texttemplate='%{text} MB', textposition='outside')
fig.update_layout(template='seaborn', title='Effect of Downcasting')
fig.show()

3. Melting the data#

Currently, the data is in three dataframes: sales, prices & calendar. The sales dataframe contains daily sales data with days(d_1 - d_1969) as columns. The prices dataframe contains items’ price details and calendar contains data about the days d.

3.1 Convert from wide to long format

Here's an example of conversion of a wide dataframe to a long dataframe.

In this case what the melt function is doing is that it is converting the sales dataframe which is in wide format to a long format. I have kept the id variables as id, item_id, dept_id, cat_id, store_id and state_id. They have in total 30490 unique values when compunded together. Now the total number of days for which we have the data is 1969 days. Therefore the melted dataframe will be having 30490x1969 i.e. 60034810 rows

df = pd.melt(sales, id_vars=['id', 'item_id', 'dept_id', 'cat_id', 'store_id', 'state_id'], var_name='d', value_name='sold').dropna()

3.2 Combine the data

Combine price data from prices dataframe and days data from calendar dataset.
df = pd.merge(df, calendar, on='d', how='left')
df = pd.merge(df, prices, on=['store_id','item_id','wm_yr_wk'], how='left') 

Now I have a single and a complete dataframe with all the data required. Let's Explore!

4. Exploratory Data Analysis#

4.1 The Dataset

The M5 dataset, generously made available by Walmart, involves the unit sales of various products sold in the USA, organized in the form of grouped time series. More specifically, the dataset involves the unit sales of 3,049 products, classified in 3 product categories (Hobbies, Foods, and Household) and 7 product departments, in which the above-mentioned categories are disaggregated. The products are sold across ten stores, located in three States (CA, TX, and WI). I have drawn an interactive visualization showing the distribution of 3049 items across different aggregation levels.
group = sales.groupby(['state_id','store_id','cat_id','dept_id'],as_index=False)['item_id'].count().dropna()
group['USA'] = 'United States of America'
group.rename(columns={'state_id':'State','store_id':'Store','cat_id':'Category','dept_id':'Department','item_id':'Count'},inplace=True)
group = group[group["Count"] != 0]
fig = px.treemap(group, path=['USA', 'State', 'Store', 'Category', 'Department'], values='Count',
                  color='Count',
                  color_continuous_scale= px.colors.sequential.Sunset,
                  title='Walmart: Distribution of items')
fig.update_layout(template='seaborn')
fig.show()

4.2 Item Prices

Here I'll be studying about the item prices and their distribution. Please note the prices vary weekly. So to study the distribution of prices I have taken their average.
group_price_store = df.groupby(['state_id','store_id','item_id'],as_index=False)['sell_price'].mean().dropna()
fig = px.violin(group_price_store, x='store_id', color='state_id', y='sell_price',box=True, hover_name='item_id')
fig.update_xaxes(title_text='Store')
fig.update_yaxes(title_text='Selling Price($)')
fig.update_layout(template='seaborn',title='Distribution of Items prices wrt Stores',legend_title_text='State')
fig.show()

Below are some of the observations from the above plot:-

  • The distribution of item prices is almost uniform for all the stores across Califoria, Texas and Wisconsin.
  • Item HOBBIES_1_361 priced at around 30.5 dollars is the costliest item being sold at walmarts across California.
  • Item HOUSEHOLD_1_060 priced at around 29.875 dollars is the costliest item being sold at walmarts across Texas.
  • Item HOBBIES_1_361 priced at around 30.48 dollars is the costliest item being sold at TX_1 and TX_3 in Texas. While item HOBBIES_1_255 priced at around 30.5 dollars is the costliest at TX_2
group_price_cat = df.groupby(['store_id','cat_id','item_id'],as_index=False)['sell_price'].mean().dropna()
fig = px.violin(group_price_cat, x='store_id', color='cat_id', y='sell_price',box=True, hover_name='item_id')
fig.update_xaxes(title_text='Store')
fig.update_yaxes(title_text='Selling Price($)')
fig.update_layout(template='seaborn',title='Distribution of Items prices wrt Stores across Categories',
                 legend_title_text='Category')
fig.show()

As can be seen from the plot above, food category items are quite cheap as compared with hobbies and household items. Hobbies and household items have almost the same price range.

4.3 Items Sold

Let's study the sales accross all the stores.
group = df.groupby(['year','date','state_id','store_id'], as_index=False)['sold'].sum().dropna()
fig = px.violin(group, x='store_id', color='state_id', y='sold',box=True)
fig.update_xaxes(title_text='Store')
fig.update_yaxes(title_text='Total items sold')
fig.update_layout(template='seaborn',title='Distribution of Items sold wrt Stores',legend_title_text='State')
fig.show()

Below are some of the observations from the above plot:-

  • California: CA_3 has sold the most number of items while, CA_4 has sold the least number of items.
  • Texas: TX_2 and **TX_3** have sold the maximum number of items. TX_1 has sold the least number of items.
  • Wisconsin: WI_2 has sold the maximum number of items while, WI_3 has sold the least number of items.
  • USA: CA_3 has sold the most number of items while, CA_4 has sold the least number of items.

Let’s study number of items sold over time across all the stores.

fig = go.Figure()
title = 'Items sold over time'
years = group.year.unique().tolist()
buttons = []
y=3
for state in group.state_id.unique().tolist():
    group_state = group[group['state_id']==state]
    for store in group_state.store_id.unique().tolist():
        group_state_store = group_state[group_state['store_id']==store]
        fig.add_trace(go.Scatter(name=store, x=group_state_store['date'], y=group_state_store['sold'], showlegend=True, 
                                   yaxis='y'+str(y) if y!=1 else 'y'))
    y-=1

fig.update_layout(
        xaxis=dict(
        #autorange=True,
        range = ['2011-01-29','2016-05-22'],
        rangeselector=dict(
            buttons=list([
                dict(count=1,
                     label="1m",
                     step="month",
                     stepmode="backward"),
                dict(count=6,
                     label="6m",
                     step="month",
                     stepmode="backward"),
                dict(count=1,
                     label="YTD",
                     step="year",
                     stepmode="todate"),
                dict(count=1,
                     label="1y",
                     step="year",
                     stepmode="backward"),
                dict(count=2,
                     label="2y",
                     step="year",
                     stepmode="backward"),
                dict(count=3,
                     label="3y",
                     step="year",
                     stepmode="backward"),
                dict(count=4,
                     label="4y",
                     step="year",
                     stepmode="backward"),
                dict(step="all")
            ])
        ),
        rangeslider=dict(
            autorange=True,
        ),
        type="date"
    ),
    yaxis=dict(
        anchor="x",
        autorange=True,
        domain=[0, 0.33],
        mirror=True,
        showline=True,
        side="left",
        tickfont={"size":10},
        tickmode="auto",
        ticks="",
        title='WI',
        titlefont={"size":20},
        type="linear",
        zeroline=False
    ),
    yaxis2=dict(
        anchor="x",
        autorange=True,
        domain=[0.33, 0.66],
        mirror=True,
        showline=True,
        side="left",
        tickfont={"size":10},
        tickmode="auto",
        ticks="",
        title = 'TX',
        titlefont={"size":20},
        type="linear",
        zeroline=False
    ),
    yaxis3=dict(
        anchor="x",
        autorange=True,
        domain=[0.66, 1],
        mirror=True,
        showline=True,
        side="left",
        tickfont={"size":10},
        tickmode="auto",
        ticks='',
        title="CA",
        titlefont={"size":20},
        type="linear",
        zeroline=False
    )
    )
fig.update_layout(template='seaborn', title=title)
fig.show()

4.4 State wise Analysis

California1 Texas2 Wisconsin3
In this section, I will be studying the sales and revenue of all the stores individually across all the three states: California, Texas & Wisconsin. I have plotted total three plots for each store: CA_1, CA_2, CA_3, CA_4, TX_1, TX_2, TX_3, WI_1, WI_2 & WI_3. Details about the plots are as follows:- - First plot shows the daily sales of a store. I have plotted the values separately for SNAP days. Also, SNAP promotes food purchase, I have plotted food sales as well to check if it really affects the food sales. - Second plot shows the daily revenue of a store with separate plotting for SNAP days. - Third is a heatmap to show daily sales. It's plotted in such a way that it becomes easier to see day wise values.
df['revenue'] = df['sold']*df['sell_price'].astype(np.float32)
def introduce_nulls(df):
    idx = pd.date_range(df.date.dt.date.min(), df.date.dt.date.max())
    df = df.set_index('date')
    df = df.reindex(idx)
    df.reset_index(inplace=True)
    df.rename(columns={'index':'date'},inplace=True)
    return df

def plot_metric(df,state,store,metric):
    store_sales = df[(df['state_id']==state)&(df['store_id']==store)&(df['date']<='2016-05-22')]
    food_sales = store_sales[store_sales['cat_id']=='FOODS']
    store_sales = store_sales.groupby(['date','snap_'+state],as_index=False)['sold','revenue'].sum()
    snap_sales = store_sales[store_sales['snap_'+state]==1]
    non_snap_sales = store_sales[store_sales['snap_'+state]==0]
    food_sales = food_sales.groupby(['date','snap_'+state],as_index=False)['sold','revenue'].sum()
    snap_foods = food_sales[food_sales['snap_'+state]==1]
    non_snap_foods = food_sales[food_sales['snap_'+state]==0]
    non_snap_sales = introduce_nulls(non_snap_sales)
    snap_sales = introduce_nulls(snap_sales)
    non_snap_foods = introduce_nulls(non_snap_foods)
    snap_foods = introduce_nulls(snap_foods)
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=non_snap_sales['date'],y=non_snap_sales[metric],
                           name='Total '+metric+'(Non-SNAP)'))
    fig.add_trace(go.Scatter(x=snap_sales['date'],y=snap_sales[metric],
                           name='Total '+metric+'(SNAP)'))
    fig.add_trace(go.Scatter(x=non_snap_foods['date'],y=non_snap_foods[metric],
                           name='Food '+metric+'(Non-SNAP)'))
    fig.add_trace(go.Scatter(x=snap_foods['date'],y=snap_foods[metric],
                           name='Food '+metric+'(SNAP)'))
    fig.update_yaxes(title_text='Total items sold' if metric=='sold' else 'Total revenue($)')
    fig.update_layout(template='seaborn',title=store)
    fig.update_layout(
        xaxis=dict(
        #autorange=True,
        range = ['2011-01-29','2016-05-22'],
        rangeselector=dict(
            buttons=list([
                dict(count=1,
                     label="1m",
                     step="month",
                     stepmode="backward"),
                dict(count=6,
                     label="6m",
                     step="month",
                     stepmode="backward"),
                dict(count=1,
                     label="YTD",
                     step="year",
                     stepmode="todate"),
                dict(count=1,
                     label="1y",
                     step="year",
                     stepmode="backward"),
                dict(count=2,
                     label="2y",
                     step="year",
                     stepmode="backward"),
                dict(count=3,
                     label="3y",
                     step="year",
                     stepmode="backward"),
                dict(count=4,
                     label="4y",
                     step="year",
                     stepmode="backward"),
                dict(step="all")
            ])
        ),
        rangeslider=dict(
            autorange=True,
        ),
        type="date"
    ))
    return fig
cal_data = group.copy()
cal_data = cal_data[cal_data.date <= '22-05-2016']
cal_data['week'] = cal_data.date.dt.weekofyear
cal_data['day_name'] = cal_data.date.dt.day_name()
def calmap(cal_data, state, store, scale):
    cal_data = cal_data[(cal_data['state_id']==state)&(cal_data['store_id']==store)]
    years = cal_data.year.unique().tolist()
    fig = make_subplots(rows=len(years),cols=1,shared_xaxes=True,vertical_spacing=0.005)
    r=1
    for year in years:
        data = cal_data[cal_data['year']==year]
        data = introduce_nulls(data)
        fig.add_trace(go.Heatmap(
            z=data.sold,
            x=data.week,
            y=data.day_name,
            hovertext=data.date.dt.date,
            coloraxis = "coloraxis",name=year,
        ),r,1)
        fig.update_yaxes(title_text=year,tickfont=dict(size=5),row = r,col = 1)
        r+=1
    fig.update_xaxes(range=[1,53],tickfont=dict(size=10), nticks=53)
    fig.update_layout(coloraxis = {'colorscale':scale})
    fig.update_layout(template='seaborn', title=store)
    return fig

California

Store Navigator

CA_1#

fig = plot_metric(df,'CA','CA_1','sold')
fig.show()
fig = plot_metric(df,'CA','CA_1','revenue')
fig.show()
fig = calmap(cal_data, 'CA', 'CA_1', 'magma')
fig.show()

CA_2#

Go to the Store Navigator

fig = plot_metric(df,'CA','CA_2','sold')
fig.show()
fig = plot_metric(df,'CA','CA_2','revenue')
fig.show()
fig = calmap(cal_data, 'CA', 'CA_2', 'magma')
fig.show()

CA_3#

Go to the Store Navigator

fig = plot_metric(df,'CA','CA_3','sold')
fig.show()
fig = plot_metric(df,'CA','CA_3','revenue')
fig.show()
fig = calmap(cal_data, 'CA', 'CA_3', 'magma')
fig.show()

CA_4#

Go to the Store Navigator

fig = plot_metric(df,'CA','CA_4','sold')
fig.show()
fig = plot_metric(df,'CA','CA_4','revenue')
fig.show()
fig = calmap(cal_data, 'CA', 'CA_4', 'magma')
fig.show()

Texas Store Navigator

TX_1#

fig = plot_metric(df,'TX','TX_1','sold')
fig.show()
fig = plot_metric(df,'TX','TX_1','revenue')
fig.show()
fig = calmap(cal_data, 'TX', 'TX_1', 'viridis')
fig.show()

TX_2#

Go to the Store Navigator

fig = plot_metric(df,'TX','TX_2','sold')
fig.show()
fig = plot_metric(df,'TX','TX_2','revenue')
fig.show()
fig = calmap(cal_data, 'TX', 'TX_2', 'viridis')
fig.show()

TX_3#

Go to the Store Navigator

fig = plot_metric(df,'TX','TX_3','sold')
fig.show()
fig = plot_metric(df,'TX','TX_3','revenue')
fig.show()
fig = calmap(cal_data, 'TX', 'TX_3', 'viridis')
fig.show()

Wisconsin Store Navigator

WI_1#

fig = plot_metric(df,'WI','WI_1','sold')
fig.show()
fig = plot_metric(df,'WI','WI_1','revenue')
fig.show()
fig = calmap(cal_data, 'WI', 'WI_1', 'twilight')
fig.show()

WI_2#

Go to the Store Navigator

fig = plot_metric(df,'WI','WI_2','sold')
fig.show()
fig = plot_metric(df,'WI','WI_2','revenue')
fig.show()
fig = calmap(cal_data, 'WI', 'WI_2', 'twilight')
fig.show()

WI_3#

Go to the Store Navigator

fig = plot_metric(df,'WI','WI_3','sold')
fig.show()
fig = plot_metric(df,'WI','WI_3','revenue')
fig.show()
fig = calmap(cal_data, 'WI', 'WI_3', 'twilight')
fig.show()
df = df.drop("revenue", 1)

5. Feature Engineering#

Time Series data must be re-framed as a supervised learning dataset before we can start using machine learning algorithms.

There is no concept of input and output features in time series. Instead, we must choose the variable to be predicted and use feature engineering to construct all of the inputs that will be used to make predictions for future time steps.

The goal of feature engineering is to provide strong and ideally simple relationships between new input features and the output feature for the supervised learning algorithm to model.

Topics

5.1 Label Encoding

  1. Remove unwanted data to create space in RAM for further processing.
  2. Label Encode categorical features.(I had converted already converted categorical variable to category type. So, I can simply use their codes instead of using LableEncoder)
  3. Remove date as its features are already present.
#Store the categories along with their codes
d_id = dict(zip(df.id.cat.codes, df.id))
d_item_id = dict(zip(df.item_id.cat.codes, df.item_id))
d_dept_id = dict(zip(df.dept_id.cat.codes, df.dept_id))
d_cat_id = dict(zip(df.cat_id.cat.codes, df.cat_id))
d_store_id = dict(zip(df.store_id.cat.codes, df.store_id))
d_state_id = dict(zip(df.state_id.cat.codes, df.state_id))
#1
del group, group_price_cat, group_price_store, group_state, group_state_store, cal_data
gc.collect();

#2
df.d = df['d'].apply(lambda x: x.split('_')[1]).astype(np.int16)
cols = df.dtypes.index.tolist()
types = df.dtypes.values.tolist()
for i,type in enumerate(types):
    if type.name == 'category':
        df[cols[i]] = df[cols[i]].cat.codes
        
#3
df.drop('date',axis=1,inplace=True)

5.2 Introduce Lags

Go back to topics

Lag features are the classical way that time series forecasting problems are transformed into supervised learning problems.

Introduce lags to the the target variable sold. The maximum lag I have introduced is 36 days. It’s purely upto you how many lags you want to introduce.

#Introduce lags
lags = [1,2,3,6,12,24,36]
for lag in lags:
    df['sold_lag_'+str(lag)] = df.groupby(['id', 'item_id', 'dept_id', 'cat_id', 'store_id', 'state_id'],as_index=False)['sold'].shift(lag).astype(np.float16)

5.3 Mean Encoding

Go back to topics

From a mathematical point of view, mean encoding represents a probability of your target variable, conditional on each value of the feature. In a way, it embodies the target variable in its encoded value. I have calculated mean encodings on the basis of following logical features I could think of:-

  • item

  • state

  • store

  • category

  • department

  • category & department

  • store & item

  • category & item

  • department & item

  • state & store

  • state, store and category

  • store, category and department

Feel free to add more!

df['iteam_sold_avg'] = df.groupby('item_id')['sold'].transform('mean').astype(np.float16)
df['state_sold_avg'] = df.groupby('state_id')['sold'].transform('mean').astype(np.float16)
df['store_sold_avg'] = df.groupby('store_id')['sold'].transform('mean').astype(np.float16)
df['cat_sold_avg'] = df.groupby('cat_id')['sold'].transform('mean').astype(np.float16)
df['dept_sold_avg'] = df.groupby('dept_id')['sold'].transform('mean').astype(np.float16)
df['cat_dept_sold_avg'] = df.groupby(['cat_id','dept_id'])['sold'].transform('mean').astype(np.float16)
df['store_item_sold_avg'] = df.groupby(['store_id','item_id'])['sold'].transform('mean').astype(np.float16)
df['cat_item_sold_avg'] = df.groupby(['cat_id','item_id'])['sold'].transform('mean').astype(np.float16)
df['dept_item_sold_avg'] = df.groupby(['dept_id','item_id'])['sold'].transform('mean').astype(np.float16)
df['state_store_sold_avg'] = df.groupby(['state_id','store_id'])['sold'].transform('mean').astype(np.float16)
df['state_store_cat_sold_avg'] = df.groupby(['state_id','store_id','cat_id'])['sold'].transform('mean').astype(np.float16)
df['store_cat_dept_sold_avg'] = df.groupby(['store_id','cat_id','dept_id'])['sold'].transform('mean').astype(np.float16)

5.4 Rolling Window Statistics

Go back to topics

Here’s an awesome gif that explains this idea in a wonderfully intuitive way: This method is called the rolling window method because the window would be different for every data point.

I’ll be calculating weekly rolling avearge of the items sold. More features like rolling min, max or sum can also be calculated. Also, same features can be calculated for revenue as well.

df['rolling_sold_mean'] = df.groupby(['id', 'item_id', 'dept_id', 'cat_id', 'store_id', 'state_id'])['sold'].shift(1).transform(lambda x: x.rolling(window=7).mean()).astype(np.float16)

5.5 Expanding Window Statistics

Go back to topics

This is simply an advanced version of the rolling window technique. In the case of a rolling window, the size of the window is constant while the window slides as we move forward in time. Hence, we consider only the most recent values and ignore the past values. Here’s a gif that explains how our expanding window function works:

I’ll be calculating expanding avearge of the items sold. More features like expanding min, max or sum can also be calculated. Also, same features can be calculated for revenue as well.

df['expanding_sold_mean'] = df.groupby(['id', 'item_id', 'dept_id', 'cat_id', 'store_id', 'state_id'])['sold'].shift(1).transform(lambda x: x.expanding(2).mean()).astype(np.float16)

5.6 Trends

Go back to topics

I will be creating a selling trend feature, which will be some positive value if the daily items sold are greater than the entire duration average ( d_1 - d_1969 ) else negative. More trend features can be added but I’ll only add this one to keep it simple.

df['daily_avg_sold'] = df.groupby(['id', 'item_id', 'dept_id', 'cat_id', 'store_id', 'state_id','d'])['sold'].transform('mean').astype(np.float16)
df['avg_sold'] = df.groupby(['id', 'item_id', 'dept_id', 'cat_id', 'store_id', 'state_id'])['sold'].transform('mean').astype(np.float16)
df['selling_trend'] = (df['daily_avg_sold'] - df['avg_sold']).astype(np.float16)
df.drop(['daily_avg_sold','avg_sold'],axis=1,inplace=True)

5.7 Save the data

Go back to topics

Now since all the new features are created, let’s save the data so that it can be trained separately.Also, lags introduce a lot of Null values, so I’ll remove data for first 35 days as I have introduced lags till 36 days.

df = df[df['d']>=36]

Let’s look at our new dataframe.

df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 58967660 entries, 1067150 to 60034809
Data columns (total 43 columns):
 #   Column                    Dtype  
---  ------                    -----  
 0   id                        int16  
 1   item_id                   int16  
 2   dept_id                   int8   
 3   cat_id                    int8   
 4   store_id                  int8   
 5   state_id                  int8   
 6   d                         int16  
 7   sold                      int16  
 8   wm_yr_wk                  int16  
 9   weekday                   int8   
 10  wday                      int8   
 11  month                     int8   
 12  year                      int16  
 13  event_name_1              int8   
 14  event_type_1              int8   
 15  event_name_2              int8   
 16  event_type_2              int8   
 17  snap_CA                   int8   
 18  snap_TX                   int8   
 19  snap_WI                   int8   
 20  sell_price                float16
 21  sold_lag_1                float16
 22  sold_lag_2                float16
 23  sold_lag_3                float16
 24  sold_lag_6                float16
 25  sold_lag_12               float16
 26  sold_lag_24               float16
 27  sold_lag_36               float16
 28  iteam_sold_avg            float16
 29  state_sold_avg            float16
 30  store_sold_avg            float16
 31  cat_sold_avg              float16
 32  dept_sold_avg             float16
 33  cat_dept_sold_avg         float16
 34  store_item_sold_avg       float16
 35  cat_item_sold_avg         float16
 36  dept_item_sold_avg        float16
 37  state_store_sold_avg      float16
 38  state_store_cat_sold_avg  float16
 39  store_cat_dept_sold_avg   float16
 40  rolling_sold_mean         float16
 41  expanding_sold_mean       float16
 42  selling_trend             float16
dtypes: float16(23), int16(6), int8(14)
memory usage: 4.4 GB

Save the data for training.

df.to_pickle('data.pkl')
del df
gc.collect();

6. Modelling and Prediction#

data = pd.read_pickle('data.pkl')
valid = data[(data['d']>=1914) & (data['d']<1942)][['id','d','sold']]
test = data[data['d']>=1942][['id','d','sold']]
eval_preds = test['sold']
valid_preds = valid['sold']
#Get the store ids

model_list = []
stores = sales.store_id.cat.codes.unique().tolist()
for store in stores:
    df = data[data['store_id']==store]
    
    #Split the data
    X_train, y_train = df[df['d']<1914].drop('sold',axis=1), df[df['d']<1914]['sold']
    X_valid, y_valid = df[(df['d']>=1914) & (df['d']<1942)].drop('sold',axis=1), df[(df['d']>=1914) & (df['d']<1942)]['sold']
    X_test = df[df['d']>=1942].drop('sold',axis=1)
    
    #Train and validate
    model = LGBMRegressor(
        n_estimators=1000,
        learning_rate=0.3,
        subsample=0.8,
        colsample_bytree=0.8,
        max_depth=8,
        num_leaves=50,
        min_child_weight=300
    )
    print('*****Prediction for Store: {}*****'.format(d_store_id[store]))
    model.fit(X_train, y_train, eval_set=[(X_train,y_train),(X_valid,y_valid)],
             eval_metric='rmse', verbose=20, early_stopping_rounds=20)
    valid_preds[X_valid.index] = model.predict(X_valid)
    eval_preds[X_test.index] = model.predict(X_test)
    
    model_list.append(model)
    
    # Clear memory
    del model, X_train, y_train, X_valid, y_valid
    gc.collect()
*****Prediction for Store: CA_1*****
[20]	training's rmse: 0.904045	training's l2: 0.817298	valid_1's rmse: 0.568944	valid_1's l2: 0.323697
[40]	training's rmse: 0.889724	training's l2: 0.791609	valid_1's rmse: 0.567389	valid_1's l2: 0.32193
[60]	training's rmse: 0.880293	training's l2: 0.774915	valid_1's rmse: 0.564904	valid_1's l2: 0.319116
[80]	training's rmse: 0.872382	training's l2: 0.761049	valid_1's rmse: 0.563527	valid_1's l2: 0.317562
*****Prediction for Store: CA_2*****
[20]	training's rmse: 0.580716	training's l2: 0.337231	valid_1's rmse: 0.53875	valid_1's l2: 0.290251
[40]	training's rmse: 0.56762	training's l2: 0.322193	valid_1's rmse: 0.531685	valid_1's l2: 0.282689
[60]	training's rmse: 0.563519	training's l2: 0.317554	valid_1's rmse: 0.52741	valid_1's l2: 0.278161
[80]	training's rmse: 0.560069	training's l2: 0.313677	valid_1's rmse: 0.523966	valid_1's l2: 0.274541
[100]	training's rmse: 0.55695	training's l2: 0.310193	valid_1's rmse: 0.522462	valid_1's l2: 0.272967
[120]	training's rmse: 0.554513	training's l2: 0.307484	valid_1's rmse: 0.521643	valid_1's l2: 0.272111
[140]	training's rmse: 0.552004	training's l2: 0.304708	valid_1's rmse: 0.523552	valid_1's l2: 0.274107
*****Prediction for Store: CA_3*****
[20]	training's rmse: 1.3662	training's l2: 1.86649	valid_1's rmse: 0.715061	valid_1's l2: 0.511312
[40]	training's rmse: 1.33013	training's l2: 1.76924	valid_1's rmse: 0.701512	valid_1's l2: 0.492119
[60]	training's rmse: 1.30702	training's l2: 1.70831	valid_1's rmse: 0.69533	valid_1's l2: 0.483484
[80]	training's rmse: 1.28812	training's l2: 1.65926	valid_1's rmse: 0.689454	valid_1's l2: 0.475347
[100]	training's rmse: 1.2693	training's l2: 1.61113	valid_1's rmse: 0.67785	valid_1's l2: 0.459481
[120]	training's rmse: 1.25489	training's l2: 1.57474	valid_1's rmse: 0.681503	valid_1's l2: 0.464446
*****Prediction for Store: CA_4*****
[20]	training's rmse: 0.422804	training's l2: 0.178763	valid_1's rmse: 0.348529	valid_1's l2: 0.121472
[40]	training's rmse: 0.416184	training's l2: 0.173209	valid_1's rmse: 0.342484	valid_1's l2: 0.117295
[60]	training's rmse: 0.41293	training's l2: 0.170511	valid_1's rmse: 0.340811	valid_1's l2: 0.116152
[80]	training's rmse: 0.411131	training's l2: 0.169029	valid_1's rmse: 0.340461	valid_1's l2: 0.115914
*****Prediction for Store: TX_1*****
[20]	training's rmse: 0.852924	training's l2: 0.727479	valid_1's rmse: 0.52504	valid_1's l2: 0.275667
[40]	training's rmse: 0.835764	training's l2: 0.698501	valid_1's rmse: 0.51213	valid_1's l2: 0.262277
[60]	training's rmse: 0.827983	training's l2: 0.685556	valid_1's rmse: 0.509391	valid_1's l2: 0.259479
[80]	training's rmse: 0.821042	training's l2: 0.674111	valid_1's rmse: 0.507065	valid_1's l2: 0.257115

Plotting feature importances

feature_importance_df = pd.DataFrame()
features = [f for f in data.columns if f != 'sold']
for model, store in zip(model_list, stores):
        store_importance_df = pd.DataFrame()
        store_importance_df["feature"] = features
        store_importance_df["importance"] = model.feature_importances_
        store_importance_df["store"] = store
        feature_importance_df = pd.concat([feature_importance_df, store_importance_df], axis=0)
    
def display_importances(feature_importance_df_):
    cols = feature_importance_df_[["feature", "importance"]].groupby("feature").mean().sort_values(by="importance", ascending=False)[:20].index
    best_features = feature_importance_df_.loc[feature_importance_df_.feature.isin(cols)]
    plt.figure(figsize=(8, 10))
    sns.barplot(x="importance", y="feature", data=best_features.sort_values(by="importance", ascending=False))
    plt.title('LightGBM Features (averaged over store predictions)')
    plt.tight_layout()
    
display_importances(feature_importance_df)

Make the submission

If you remember for EDA, feature engineering and training I had melted the provided data from wide format to long format. Now, I have the predictions in long format but the format to be evaluated for the competition is in long format. Therefore, I’ll convert it into wide format using pivot function in pandas. Below is an image explaining the pivot function.

#Set actual equal to false if you want to top in the public leaderboard :P
actual = False
if actual == False:
    #Get the validation results(We already have them as less than one month left for competition to end)
    validation = sales[['id']+['d_' + str(i) for i in range(1914,1942)]]
    validation['id']=pd.read_csv('/kaggle/input/m5-forecasting-accuracy/sales_train_validation.csv').id
    validation.columns=['id'] + ['F' + str(i + 1) for i in range(28)]
else:
    #Get the actual validation results
    valid['sold'] = valid_preds
    validation = valid[['id','d','sold']]
    validation = pd.pivot(validation, index='id', columns='d', values='sold').reset_index()
    validation.columns=['id'] + ['F' + str(i + 1) for i in range(28)]
    validation.id = validation.id.map(d_id).str.replace('evaluation','validation')

#Get the evaluation results
test['sold'] = eval_preds
evaluation = test[['id','d','sold']]
evaluation = pd.pivot(evaluation, index='id', columns='d', values='sold').reset_index()
evaluation.columns=['id'] + ['F' + str(i + 1) for i in range(28)]
#Remap the category id to their respective categories
evaluation.id = evaluation.id.map(d_id)

#Prepare the submission
submit = pd.concat([validation,evaluation]).reset_index(drop=True)
submit.to_csv('submission.csv',index=False)

Resources#

This demo was adapted from the following kaggle notebook.